Probability Space consists of S and P, where S is a sample space and P is a function which takes an event A⊆S as input, returns P(A)∈[0,1] as output, such that:
P(∅)=0,P(S)=1
P(n=1⋃∞An)=n=1∑∞P(An) if A1,A2,…,An are disjoint (non-overlapping)
Example: DeMontmort's Problem (1713), matching problem
n cards labeled 1,2,…,n, let Aj be the event "jth card matches" P(Aj)=n!(n−1)!=n1P(A1∩A2)=n!(n−2)!=n(n−1)1⋮P(A1∩A2∩⋯∩Ak)=n!(n−k)! P(j=1⋃nAj)=P(A1∪A2∪⋯∪An)=n⋅n1−(2n)⋅n!(n−2)!+(3n)⋅n!(n−3)!−⋯+(−1)n+1⋅n!1=1−2!1+3!1−⋯+(−1)n+1⋅n!1≈ex=∑n=0∞n!1xn1−e1 P(nomatch)=P(j=1⋂nAjc)=1−1+2!1−3!1+⋯+(−1)nn!1≈e1
1 door has car, 2 doors have goats, Monty knows which, he always opens a goat door if he has a choice, he picks with equal probability. Should you switch?
P(success if switch|Monty opens door 2)=31×21+3131=32
graph LR
A(initial pick door 1) --1/3--> B(car door is 1)
A(initial pick door 1) --1/3--> C(car door is 2)
A(initial pick door 1) --1/3--> D(car door is 3)
B(car door is 1) --1/2--> E(Monty opens door 2)
B(car door is 1) --1/2--> F(Monty opens door 3)
C(car door is 2) --1--> G(Monty opens door 3)
D(car door is 3) --1--> H(Monty opens door 2)
linkStyle 0 stroke:#ff0000,stroke-width:3px
linkStyle 3 stroke:#ff0000,stroke-width:3px
linkStyle 2 stroke:#ff0000,stroke-width:3px
linkStyle 6 stroke:#ff0000,stroke-width:3px
1.4 Total Probability Theorem & Bayes' Rule
Total Probability Theorem: 设A1,A2,…An是一组互不相容的事件,形成样本空间的一个分割(每一个试验结果必定使得其中一个事件发生) P(B)=P(A1∩B)+⋯+P(An∩B)=P(A1)P(B∣A1)+⋯+P(An)P(B∣An)
任意事件B的概率等于事件B在Ai发生的情况下的条件概率的加权平均,权数等于事件Ai的无条件概率
Example: patient get tested for disease that afflicts 1% of population, suppose test is advertised as "95% accurate", find the probability that patient has disease given tests positive.
Let D: patient has disease, T: patient tests positive P(D∣T)=P(T)P(T∣D)P(D)=P(T∣D)P(D)+P(T∣Dc)P(Dc)P(T∣D)P(D)=0.95×0.01+0.05×0.990.95×0.01≈0.16
"Biohazards"
Confusing P(A|B) with P(B|A), "prosecutor's fallacy", Sally Clark case
confusing P(A) "prior" with P(A|B) "posterior"
confusing independence with conditional independence
1.5 Independence
Events A,B are independent if P(A∩B)=P(A)P(B)P(A∣B)=P(A)
若A,B相互独立,A,Bc也相互独立
1.5.1 Conditional Independence
Events A,B are conditionally independent given C if P(A∩B∣C)=P(A∣C)P(B∣C)
Also, P(A∩B∣C)=P(C)P(A∩B∩C)=P(C)P(C)P(B∣C)P(A∣B∩C)=P(B∣C)P(A∣B∩C)⟹P(A∣B∩C)=P(A∣C),givenP(B∣C)=0 给定C发生的条件下,进一步假定B也发生,并不影响事件A的条件概率
A和B两个事件相互独立并不蕴含条件独立,反之亦然:
e.g. A,B相互独立,C满足条件P(C)>0,P(A∣C)>0,P(B/C)>0,且A∩B∩C=∅,由于P(A∩B∣C)=0,P(A∣C)P(B∣C)>0,A,B不可能条件独立 H1={硬币第一次扔得正面}H2={硬币第二次扔得正面}D={两次扔得的结果不同}
1.5.2 Independence of a series of events
A1,A2,…,An is mutually independent if P(i∈S⋂Ai)=i∈S∏P(Ai)foranysubsetSof{1,2,…,n}
e.g. Events A1,A2,A3 are independent if
P(A1∩A2)=P(A1)P(A2)
P(A1∩A3)=P(A1)P(A3)
P(A2∩A2)=P(A2)P(A3)
P(A1∩A2∩A3)=P(A1)P(A2)P(A3)
三个事件两两独立不包含三个事件独立
Gambler's Ruin: 一个赌徒进行一系列相互独立的押注活动,每次押注以概率p赢一元,q=1−p输一元。开始押注时有k元,当自己输光或累计钱数为n元(对方输光)时,停止押注,问他累计钱数为n元(对方输光)的概率有多大?
Strategy: condiiton on the first step (first step analysis)
Let A = {累计钱数为n元(对方输光)},wk = P(A | 以k元开始押注) wk=LOTPpwk+1+qwk−1,0<k<n,and{w0=0wn=1Letr=q/p,wk+1−wk=r(wk−wk−1)=rk(w1−w0)=rkw1wk+1=wk+rkw1=wk−1+rk−1w1+rkw1=w1+rw1+⋯+rkw1wk=⎩⎪⎨⎪⎧1−r1−rkw1kw1,ifp=q,ifp=q∵wn=1,w1=⎩⎪⎪⎨⎪⎪⎧1−rn1−rn1,ifp=q,ifp=q∴wk=⎩⎪⎪⎨⎪⎪⎧1−rn1−rknk,ifp=q,ifp=q
1.5.4 Independent Experiment & Binomial Probability
p(k)=(kn)pk(1−p)n−k
k=0∑n(kn)pk(1−p)n−k=1
when p = 1/2, 等同于n元素集合子集个数问题 k=0∑n(kn)=2n
1.6 Counting
1.6.1 Multiplication Rule
If have experiment with n1 possible outcomes, and for each outcome of 1st experiment, there are n2 outcomes for 2nd experiment, …, for each there are nr outcomes for rth experiment, then: n1n2⋯nr=overallpossibleoutcomes
e.g. n元素集合{s1,s2,…,sn}的子集的个数=2n
1.6.2 Permutation
1.6.3 Combination
Sampling table: choose k objects out of n
order matters
order doesn't matter
replace
nk
(kn+k−1)
don't replace
(n−k)!n!
(kn)
Properties:
(kn)=(n−kn)
n(k−1n−1)=k(kn)
pick k people out of n, with 1 designated as "President":
pick "President" and then pick the rest = pick k people and choose one to be "President"
X is #white balls in n draws without replacement from a population of size N=w+b that contains w white balls and b black balls
pX(k)=(nw+b)(kw)(n−kb),0≤k≤w
nw+bw
N−1N−nnNwNb
Geometric X∼Geo(p)
indepedent Bern(p) trials, keep failing until suceed at the kth trial
pX(k)=(1−p)k−1p,k=1,2,…
p1
p21−p
MX(s)=1−(1−p)espes
Poisson X∼Pois(λ)
Bin(n,p) converges to Pois(λ) as n→∞,p→0
pX(k)=e−λk!λk,k=0,1,2,…
λ
λ
MX(s)=eλ(es−1)
Negative Binomial X∼NB(r,p)
X is #failures before rth success
pX(k)=(kk+r−1)pk(1−p)r,k=0,1,…
1−ppr
(1−p)2pr
MX(s)=(1−pes1−p)r
Properties of Multinomial Distribution
(a) Its marginal distribution is Binomial: X∼Mult(n,p),Xj∼Mult(n,pj)
(b) Lumping property: for example X=(X1,X2,…,X10)∼Mult(n,(p1,…,p10))LetY=(X1,X2,(X3+⋯+X10))Then,Y∼Mult(n,(p1,p2,(p3+⋯+p10)))
(c) Conditional joint distribution: for example X∼Mult(n,p),andgivenX1=n1(X2,…,Xk)∼Mult(n−n1,(p2′,…,pk′))wherepj′=p2+⋯+pkpj
Proof that Bin(n,p) converged to Pois(λ) as n→∞,p→0,constantλ=np(p=nλ): pX(k)=(n−k)!k!n!pk(1−p)n−k=nkn(n−1)⋯(n−k+1)⋅k!λk⋅(1−nλ)n−kLetkfixed,j=1,2,…,k,∵n→∞,∴nn−k+j→1,(1−nλ)−k→1,(1−nλ)n→e−λpX(k)→e−λk!λk
2.3 Functions of Random Variables
ifY=g(X),pY(y)={x∣g(x)=y}∑pX(x)
2.4 Expectation, Mean, and Variance
2.4.1 Variance, Moment and Law of the Unconscious Statistician(LOTUS)
Conditional expectation of random variable X given random variable Y=y: E[X∣Y=y]=x∑xpX∣Y(x∣y)
E[X]=y∑pY(y)E[X∣Y=y](3)
Total Expectation Theorem: (1),(2),(3) Unconditional expectation is the expectation of conditional expectations
Proof of expectation and variance of X∼Geo(p): LetA1={X=1}={succeedat1sttrail},A2={X>1}={failat1sttrail},E[X∣A1]=1,E[X∣A2]=E[1+X]=1+E[X]E[X]=P(A1)E[X∣A1]+P(A2)E[X∣A2]=p+(1−p)(1+E[X])E[X]=p1E[X2∣A1]=1,E[X2∣A2]=E[(1+X)2]=1+2E[X]+E[X2]E[X2]=p+(1−p)(1+2E[X]+E[X2])E[X2]=p1+2(1−p)E[X]=p22−pvar(X)=E[X2]−(E[X])2=p22−p−p21=p21−p
2.7 Independence
2.7.1 Independence between Random Variable and Event
Random variable X is independent with event A: P({X=x}∩A)=P(X=x)P(A)=pX(x)P(A),forallxpX(x)=pX∣A(x),forallx
2.7.2 Independence between Two Random Variables
Random variable X is independent with random variable Y: pX,Y(x,y)=pX(x)pY(y),forallxandypX(x)=pX∣Y(x∣y),forallx
Random variable X is conditionally independent with random variable Y given event A occurred: pX,Y∣A(x,y)=pX∣A(x)pY∣A(y),forallxandypX∣Y,A(x∣y)=pX∣A(x),forallx
Expectation of product of independent random variables XY: E[XY]=E[X]E[Y]E[g(X)h(Y)]=E[g(X)]E[h(Y)]
Variance of sum of independent random variables X+Y: var(X+Y)=var(X)+var(Y)
Proof
∵var(X+c)=var(X)
Shift the random variables and let their expectations be 0 X~=X−E[X],Y~=Y−E[Y] var(X+Y)=var(X~+Y~)=E[(X~+Y~)2]−0(E[X~+Y~])2=E[X~2+2X~Y~+Y~2]=E[X~2]+02E[X~]E[Y~]+E[Y~2]=var(X~)+var(Y~)=var(X)+var(Y)
2.7.3 Independence among Multiple Random Variables
Random variables X,Y,Z are independent: pX,Y,Z(x,y,z)=pX(x)pY(y)pZ(z),forallx,y,z f(X),g(Y),h(Z) are independent; g(X,Y),h(Z) are independent; g(X,Y),h(Y,Z) are usually not independent
2.7.4 Variance of Sum of Multiple Independent Variables
X1,…,Xn are independent random variables: var(X1+⋯+Xn)=var(X1)+⋯+var(Xn)
Chapter 3 General Random Variables
3.1 Continuous Random Variables
Random Variable X
Definition
Probability Density Function fX(x)
Cumulative Distribution Function FX(x)
Expectation E[X]=∫−∞∞xfX(x)dx
Variance var(X)=E[X2]−(E[X])2
Moment Generating Function MX(s)=E[esX]
Graph of PDF
Graph of CDF
Continuous Uniform X∼Unif(a,b)
An experiment where there is an arbitrary outcome that lies between certain bounds
fX(x)=⎩⎪⎨⎪⎧b−a10,ifa≤x≤b,otherwise
FX(x)=b−ax−a
2a+b
12(b−a)2
MX(s)=b−a1⋅sesb−esa
Beta X∼Beta(α,β),α>0,β>0
The conjugate prior probability distribution for Bern, Bin, NB, Geo in Bayesian inference; a suitable model for the random behavior of percentages and proportions
The time between events in a Poisson process (in which events occur continuously and independently at a constant average rate)
fX(x)=λe−λx,ifx≥0
FX(x)=1−e−λx
λ1
λ21
MX(s)=λ−sλ,s<λ
Laplace Z∼Laplace(0,λ−1)
Z=X−Y, where X,Y∼i.i.d.Exp(λ)
fZ(z)=2λe−λ∣z∣
FZ(z)=⎩⎪⎪⎨⎪⎪⎧1−21e−λz21eλz,z≥0,z<0
0
λ22
Gamma X∼Gamma(k,λ)
kth arrival time in Poisson process, sum of i.i.d. Exp(λ)
fX(x)=Γ(k)1(λx)ke−λxx1whereΓ(k)=∫0∞xke−xxdx
λk
λ2k
MX(s)=(1−λs)k1,s<λ
Normal X∼N(μ,σ)
The sum of a large number of independent and identically distributed r.v.'s has an approximately normal CDF
fX(x)=σ2π1e2σ2−(x−μ)2
FX(x)=σ2π1∫−∞xe2σ2−(t−μ)2dt
μ
σ2
MX(s)=e(σ2s2/2)+μs
Standard Normal Z∼N(0,1)
Z=σX−μ
fZ(z)=2π1e−z2/2
Φ(x)=2π1∫−∞xe−t2/2dt
0
1
MX(s)=es2/2
Cauchy T∼Cauchy(0,1)
T=Z2Z1, where Z1,Z2∼i.i.d.N(0,1)
fT(t)=π(1+t2)1
FT(t)=π1arctan(t)+21
undefined
undefined
does not exist
Chi-Squared V∼χ2(k)
V=i=1∑kZi2, where Zi∼i.i.d.N(0,1), χ2(k) is Gamma(2k,21)
fV(v)=2k/2Γ(k/2)1vk/2e−v/2v1
k
2k
Student's t
Tn=V/nZ, where Z∼N(0,1),V∼χ2(n)
fTn(t)=nπΓ(2n)Γ(2n+1)(1+nt2)−2n+1
{0undefined,n>1,otherwise
⎩⎪⎪⎪⎨⎪⎪⎪⎧n−2n∞undefined,n>2,1<n≤2,otherwise
undefined
CDFs of geometric and exponential r.v.'s: X∼Geo(p):Fgeo(n)=k=1∑n(1−p)k−1p=p⋅1−(1−p)1−(1−p)n=1−(1−p)n,n=1,2,…X∼Exp(λ):Fexp(x)=∫0xλe−λtdt=−e−λt∣∣∣0x=1−e−λx,x>0 Letδ=−λln(1−p),sothate−λδ=1−pe−λδ=1−p⟹Fexp(nδ)=Fgeo(n),whenn=1,2,…
Let δ be a small positive number and consider the probability of a small rectangle: P(a≤X≤a+δ,c≤Y≤c+δ)=∫cc+δ∫aa+δfX,Y(x,y)dxdy≈fX,Y(a,c)δ2 view fX,Y(a,c) as the "probability per unit area" in the vicinity of (a,c)
Two-dimensional uniform PDF: uniform joint PDF on subset S of the two-dimensional plane fX,Y(x,y)=⎩⎪⎨⎪⎧areaofS10,if(x,y)∈S,otherwise For any set A⊂S, the probability that (X,Y) lies in A is P((X,Y)∈A)=(x,y)∈A∬fX,Y(x,y)dxdy=areaofS1∫(X,Y)∈A∫dxdy=areaofSareaofA
Let δ,ϵ be small positive numbers P(x≤X≤x+δ∣A)≈fX∣A(x)⋅δP(x≤X≤x+δ∣y≤Y≤y+ϵ)≈fY(y)⋅ϵfX,Y(x,y)⋅δ⋅ϵ=fX∣Y(x∣y)⋅δwhenϵ→0,P(x≤X≤x+δ∣Y=y)≈fX∣Y(x∣y)⋅δmoregenerally,P(X∈A∣Y=y)=∫AfX∣Y(x∣y)dx
Calculate the CDF FY of Y: FY=P(Y≤y)=P(g(X)≤y)=∫{x∣g(x)≤y}fX(x)dx
Differentiate to obtain the PDF of Y: fY(y)=dydFY(y)
4.1.1 The Linear Case
LetY=aX+b,fY(y)=∣a∣1fX(ay−b)
4.1.2 The Monotonic Case
Suppose that g is strictly monotonic and for all x in the range of X we have y=g(x)ifandonlyifx=g−1(y) Assume that h is differentiable, then fY(y)=fX(x)∣∣∣∣∣dydx∣∣∣∣∣
Example: derive Log Normal from Normal
Let Y=eZ, Z∼N(0,1), Y=g(Z) is increasing and infinitely differentiable fY(y)=fZ(z)dydz=2πy1e−(lny)2/2
Derived distribution in Rn: Y=g(X),whereX=(X1,…,Xn), the joint PDF of Y is fY(y)=fX(x)∣∣∣∣∣dydx∣∣∣∣∣,where∣∣∣∣∣dydx∣∣∣∣∣isabsolutevalueofdeterminantofJacobiandydx=⎣⎢⎢⎢⎢⎢⎡∂y1∂x1⋮∂y1∂xn⋯⋮⋯∂yn∂x1⋮∂yn∂xn⎦⎥⎥⎥⎥⎥⎤
4.1.3 Multiple Random Variables
Laplace (two-sided exponential) PDF:
Romeo and Juliet have a date at a given time, and each, independently, will be late by an amount of time that is exponentially distributed with parameter λ. What is the PDF of the difference between their times of arrival?
Let X,Y be the amounts by which Romeo and Juliet are late. We want to find the PDF of Z=X−Y. We will first calculate the CDF FZ(z) by considering seperately the cases z≥0 and z<0. fX,Y(x,y)=fX(x)fY(y)=λe−λxλe−λy,x,y≥0
For z≥0: FZ(z)=P(Z≤z)=P(X−Y≤z)=1−P(X−Y>z)=1−∫0∞∫z+y∞λe−λxλe−λydxdy=1−∫0∞λe−λy[−e−λx]z+y∞dy=1−∫0∞λe−λye−λ(z+y)dy=1−e−λz[−21e−2λy]0∞=1−21e−λz
For z<0: by symmetry, Z=X−Y and −Z=Y−X have the same distribution FZ(z)=P(Z≤z)=P(−Z≥−z)=P(Z≥−z)=1−FZ(−z)=1−(1−21eλz)=21eλz
FZ(z)=⎩⎪⎪⎨⎪⎪⎧1−21e−λz21eλz,z≥0,z<0
⟹fZ(z)=⎩⎪⎪⎨⎪⎪⎧2λe−λz2λeλz,z≥0,z<0fZ(z)=2λe−λ∣z∣fZ(z)=2λe−λ∣z∣ Can be solved by convolution(4.1.5) as well
4.1.4 Simulation
Universality of continuous uniform r.v.'s: 通过均匀分布构造服从分布F的随机变量 LetU∼Unif(0,1),FisaCDF(strictlyincreasingandcontinuous)LetX≜F−1(U),thenX∼F
4.1.5 Sum of Independent Random Variables - Convolution
Sum of independent normal r.v.'s: X∼N(μx,σx2),Y∼N(μy,σy2)fZ(z)=∫−∞∞σx2π1exp(−2σx2(x−μx)2)σy2π1exp(−2σy2(z−x−μy)2)dx=2π(σx2+σy2)1exp(−2(σx2+σy2)(z−μx−μy)2)⟹Z∼N(μx+μy,σx2+σy2) The sum of finitely many independent normal is normal
Assume zero means and unit variances, so that ρ(X,Y)=E[XY]E[(X−ρY)2]=E[X2]−2ρE[XY]+ρ2E[Y2]=1−2ρ2+ρ2=1−ρ2≥0
∣ρ∣=1,ifandonlyifY−E[Y]=c(X−E[X])
ρ(aX+b,Y)=∣a∣⋅σXσYa⋅cov(X,Y)=sign(a)⋅ρ(X,Y)
Correlation often reflects underlying, common, hidden factor
Example
Example of ρ=0.5: assume Z,V,W are independent, X=Z+V,Y=Z+W, assume Z,V,W have zero means and unit variance var(X)=var(Z)+var(V)=2⟹σX=2similarly,σY=2cov(X,Y)=E[(Z+V)(Z+W)]=E[Z2]+E[ZW]+E[VZ]+E[VW]=1ρ(X,Y)=221=21
Variance of the sum of random variables: var(X1+X2)=var(X1)+var(X2)+cov(X1,X2)var(i=1∑nXi)=i=1∑nvar(Xi)+n(n−1)terms{(i,j)∣i=j}∑cov(Xi,Xj)
If Y is an observation that provides information about X, we view the conditional expectation as an estimator of X given Y: X^=E[X∣Y]
The estimation error: X~=X^−X
is a random variable satisfying: E[X~∣Y]=E[X^−X∣Y]=E[X^∣Y]−E[X∣Y]=X^−X^=0
By using the law of iterated expectation, it indicates that the estimation error does not have a systematic upward or downward bias: E[X~]=E[E[X~∣Y]]=0
The estimator X^ is uncorrelated with the estimated error X~: E[X^X~]=E[E[X^X~∣Y]]=E[X^E[X~∣Y]]=0cov(X^,X~)=E[X^X~]−E[X^]E[X~]=0
Law of Total Variance:var(X)=var(X^)+var(X~)=var(E[X∣Y])+E[var(X∣Y)]var(E[X∣Y])+E[var(X∣Y)]
4.4 Transforms (Moment-Generating Function)
The transform associated with a r.v. X is a function MX(s) (associated MGF) of a scalar parameter s: MX(s)=E[esX]=⎩⎪⎨⎪⎧x∑esxpX(x)∫−∞∞esxfX(x)dx,ifXisdiscrete,ifXiscontinuous
Why does MGF work?Recallex=1+x+2!x2+3!x3+⋯andesx=1+sx+2!s2t2+3!s3t3+⋯⟹E[esX]=1+sE[X]+2!s2E[X2]+3!s3E[X3]+⋯⟹dsdE[esX]∣∣∣∣∣s=0=0+E[X]+sE[X2]+3!3s2E[X3]+⋯=0+E[X]+0+0+⋯=E[X]andds2d2E[esX]∣∣∣∣∣s=0=0+0+E[X2]+3!3!sE[X3]+⋯=0+0+E[X2]+0+⋯=E[X2]
Convergence of MGFs ⟹ Convergence of CDFs:
That is, let X1,X2,… be a sequence of r.v.'s, each with MGF MXi(s) and CDF FXi(x)
Suppose i→∞limMXi(s)=MX(s), ∀s in a neighborhood of 0, and MX(s) is an MGF
Then there exists a r.v. X with CDF FX(x) where i→∞limFXi(x)=FX(x)
and this r.v. X has moments determined by MGF MX(s)
Example: Convergence of Binomial and Poisson MGFs$
Bin(n,p) can be approx by Pois(λ) using λ=np,
Let X∼Bin(n,p), Y∼Pois(np), MX(s)=(1−p+pes)n MY(s)=enp(es−1) MX(s)=(1−p+pes)n=(1+p(es−1))n=(1+nnp(es−1))n=enp(es−1),whenn→∞=MY(s)
4.4.2 Inversion of Transforms
The transform MX(s) associated with a r.v. X uniquely determines the CDF of X, assuming that MX(s) is finite for all s in some interval [−a,a], where a is a positive number
The transform associated with a mixture of distributions:
Let X1,…,Xn be continuous r.v.'s with PDFs fX1,…,fXn. The value y of a r.v. Y is generated as follows: an index i is chosen with a corresponding probability pi, and y is taken to be equal to the value of Xi. Then, fY(y)=p1fX1(y)+⋯+pnfXn(y) and MY(s)=p1MX1(s)+⋯+pnMXn(s)
4.4.3 Sum of Independent Random Variables - MGF (Another way: 4.1.5 Convolution)
Let X,Y be independent r.v.'s, and let Z=X+Y: MZ(s)=E[es(X+Y)]=E[esXesY]=E[esX]E[esY]=MX(s)MY(s)
If X1,…,Xn is a collection of independent r.v.'s, and Z=X1+⋯+Xn, then MZ(s)=MX1(s)⋯MXn(s)
4.4.4 Transfroms Associated with Joint Distributions
Consider n r.v.'s X1,…,Xn related to the same experiment. Let s1,…,sn be scalar free parameters. The associated multivariate transform is a function of these n parameters: MX1,…,Xn(s1,…,sn)=E[es1X1+⋯+snXn]
4.5 Sum of a Random Number of Independent Random Variables
Let X1,X2,… be identically distributed r.v.'s with mean E[X] and variance var(X). Let N be a r.v. that takes nonnegative integer values. Assume all of these r.v.'s are independent and the sum Y=X1+⋯+XN, then:
Y∼Gamma(a,λ), that is, Y=λX,whereX∼Gamma(a,1)fY(y)=fX(x)dydx=λfX(λy)=Γ(a)1(λy)ae−λyy1
Moments of X∼Gamma(a,1)E[Xk]=∫0∞xkΓ(a)1xae−xx1=Γ(a)1∫0∞xa+ke−xx1=Γ(a)Γ(a+k),ifa+k>0E[X]=Γ(a)Γ(a+1)=aE[X2]=Γ(a)Γ(a+2)=(a+1)avar(X)=E[X2]−(E[X])2=a
Moments of Y∼Gamma(a,λ),Y=λXE[Y]=λavar(Y)=λ2a
Gamma-Exponential Connection: in Poisson process, interarrival times are i.i.d. Exp(λ), kth arrival time is Gamma(k,λ), which is sum of exponentials
Proof: Tn∼Gamma(n,1),whereTn=j=1∑nXj,Xj∼i.i.d.Exp(1) MTn(s)=E[esTn]=E[esj=1∑nXj]=(E[esX1])n=(1−s1)n,s<1 LetY∼Gamma(n,1),MY(s)=E[esY]=∫0∞esyΓ(n)1yne−yydy=Γ(n)1∫0∞yne−(1−s)yydyLetx=(1−s)y,thenMY(s)=Γ(n)1∫0∞(1−sx)ne−xxdx=(1−s)nΓ(n)1Γ(n)∫0∞xne−xxdx=(1−s1)n∴MY(s)=MTn(s),Tn∼Gamma(n,1)
Gamma-Beta Connection: bank - post office example
Waiting time at the bank X∼Gamma(a,λ), waiting time at the post office Y∼Gamma(b,λ), X,Y are independent. Find the joint distribution of total waiting time T=X+Y and the portion of time waiting at the bank W=X+YX. Let lambda=1 to simplify notations. fT,W(t,w)=fX,Y(x,y)∣∣∣∣∣∂(t,w)∂(x,y)∣∣∣∣∣=Γ(a)Γ(b)1xae−xybe−yxy1∣∣∣∣∣∂(t,w)∂(x,y)∣∣∣∣∣x+y=t,x+yx=w⇒x=tw,y=t(1−w)∣∣∣∣∣∂(t,w)∂(x,y)∣∣∣∣∣=∣∣∣∣∣∣∣∣∂t∂x∂t∂y∂w∂x∂w∂y∣∣∣∣∣∣∣∣=∣∣∣∣∣w1−wt−t∣∣∣∣∣=∣−tw−t(1−w)∣=tfT,W(t,w)=Γ(a)Γ(b)1(tw)ae−tw(t(1−w))be−t(1−w)t2w(1−w)t=Γ(a)Γ(b)1wa−1(1−w)b−1⋅ta+be−tt1=Γ(a)Γ(b)Γ(a+b)wa−1(1−w)b−1⋅Γ(a+b)1ta+be−tt1∴fT(t)=Γ(a+b)1ta+be−tt1,fW(w)=Γ(a)Γ(b)Γ(a+b)wa−1(1−w)b−1T∼Gamma(a+b,1),W∼Beta(a,b),andT,Wareindependent
Expectation of Beta distribution: ∵T,Wareindependent∴E[TW]E[X]E[W]=E[T]E[W]=E[T]E[W]=E[T]E[X]=a+ba
Order Statistics:
Let X1,…,Xn be i.i.d with PDF f and CDF F. The order statistics are X(1)≤X(2)≤⋯≤X(n), where X(1)=min(X1,…,Xn), X(2) is the second smallest, …, X(n)=max(X1,…,Xn), apparently X(j) are dependent. Find the CDF and PDF of X(j) FX(j)(x)=P(X(j)≤x)=P(atleastjofXi′s≤x)=k=j∑n(kn)F(x)k(1−F(x))n−k∵ for PDF of X(j), consider one of Xi′s is in the tiny little interval around x, j−1 of them are to the left, and n−j of them are to the right ∴fX(j)(x)dx=nf(x)dx⋅(j−1n−1)F(x)j−1(1−F(x))n−jfX(j)(x)=n(j−1n−1)F(x)j−1(1−F(x))n−jf(x) for example, if U1,…,Un are i.i.d. Unif(0,1), then fU(j)(x)=n(j−1n−1)xj−1(1−x)n−j,U(j)∼Beta(j,n−j+1)
Chapter 5 Limit Theorems
5.1 Markov and Chebyshev Inequalities
Markov Inequality (based on the mean): use a bit of information about a distribution to learn something about probabilities of "extreme events"
"If X≥0 and E[X] is small, then X is unlikely to be very large" IfX≥0anda>0,thenP(X≥a)≤aE[X]
Chebyshev Inquality (based on the mean and variance): "If the variance is small, then X is unlikely to be too far from the mean" P(∣x−μ∣≥c)≤c2σ2
Proof
Use Markov Inequality: P((X−μ)2≥c2)≤c2E[(X−μ)2]=c2σ2
P(∣x−μ∣≥kσ)≤k2σ2σ2=k21
Upper bounds in the Chebyshev Inequality:
When X is known to take values in a range [a,b], we claim that σ2≤(b−a)2/4. Thus, if σ is unknown, we may use the bound (b−a)2/4 in place of σ2 in the Chebyshev Inequality, and obtain P(∣x−μ∣≥c)≤4c2(b−a)2
Proof
To verify, note that for any constant γ, we have E[(X−γ)2]=E[X2]+2E[X]γ+γ2=(γ−E[X])2+E[X2]−(E[X])2≥E[X2]−(E[X])2=σ2 Let γ=(a+b)/2, we obtain σ2≤E[(X−2a+b)2]=E[(X−a)(X−b)]+4(b−a)2≤(x−a)(x−b)≤04(b−a)2 The bound σ2≤(b−a)2/4 may be quite conservative, but in the absence of further information about X, it cannot be improved. It is satisfied with equality when X is the r.v. that takes the two extreme values a and b with equal probability 1/2.
Related Topics:
Better bounds/approximations on tail probabilities:
(a) For some real number a and s≥0: P(X≥a)=P(esX≥esa)≤MarkovesaE[esX]=e−saMX(s)=e−(sa−lnMX(s))⟹P(X≥a)≤s≥0mine−(sa−lnMX(s))=e−s≥0max(sa−lnMX(s))=e−ϕ(a)
(b) Proof: ifa>E[X],thenϕ(a)>0 MX(s)=E[esX]=1+sE[X]+2!s2E[X2]+⋯,nears=0 Whens=0,sa−lnMX(s)=0−ln1=0 and,dsd(sa−lnMX(s))∣∣∣∣∣s=0=a−MX(s)1dsdMX(s)∣∣∣∣∣s=0=a−E[X]>0
Since sa−lnMX(s) is zero and has a positive derivative at s=0, it must be positive when s is positive and small.
(c) Example: obtain a bound for P(X≥a) for the case where X is a standard normal r.v. and a>0 X∼Z(0,1),MX(s)=es2/2ϕ(a)=s≥0max(sa−lnM(s))=s≥0max(sa−2s2)=s≥0max(−21(s−a)2+2a2)=2a2∴P(X≥a)≤e−a2/2
(d) Let X1,X2,… be independent r.v.'s with the same distribution as X. For any a>E[X], P(n1i=1∑nXi≥a)≤e−nϕ(a) so that the probability that the sample mean exceeds the mean by a certain amount decreases exponentially with n
Hoeffding's Inequality: If i.i.d. Xi is equally likely to be −1 or 1, and a>0,then P(X1+⋯+Xn≥na)≤e−na2/2
Probabilities and Frequencies: consider an event A defined the conext of some probabilistic experiment. Let p=P(A) be the probability of this event. We consider n independent repetitions of the experiment, and let Mn be the fraction of time that event A occurs; in this context, Mn is often called the empirical frequency of A. Mn=nX1+⋯+XnXi={10,ifAoccurs,otherwiseE[Xi]=p, the WLLN applies and shows that when n is large, the empirical frequency is most likely to be within ϵ of p. This allows us to conclude that empirical frequencies are faithful estimates of p.
5.3 Convergence in Probability
Convergence of a Deterministic Sequence:
Let a1,a2,… be a sequence of real numbers, and let a be another real number. We say that the sequence an converges to a, or n→∞liman=a, if for every ϵ>0 there exists some n0 such that ∣an−a∣≤ϵ,foralln≥n0
Convergence in Probability:
Let Y1,Y2,… be a sequence of r.v.'s (not necessarily independent), and let a be a real number. We say that the sequence Yn converges to a in probability, if for every ϵ>0, we have n→∞limP(∣Yn−a∣≥ϵ)=0
"Almost all of the PMF/PDF of Yn eventually gets concentrated arbitrarily close to a"
Convergence in probability does not imply convergence of expectations: convergence in probability only cares about that the tail of the distribution has small probability; the expectation is highly sensitive to the tail of the distribution
Example
Convergence in probability of the sum of two r.v.'s: IfXn→aandYn→b,thenXn+Yn→a+b
Related Topics:
Different types of convergene
Convergence "with probability 1": when an outcome of the experiment is obtained, the resulting sequence of the values of the r.v.'s Yn will converge to the value of the r.v. YP({ω:Yn(ω)⟶n→∞Y(ω)})=1
Strong law of large numbers (5.5)
Convergence of a sequence of distributions (CDFs) to a limiting CDF
5.4 Central Limit Theorem
Different scaling of the sum of i.i.d. r.v.'s:
Sn=X1+⋯+XnE[Sn]=nμ,var(Sn)=nσ2
Mn=nSn=nX1+⋯+XnE[Mn]=μ,var(Mn)=nσ2
Zn=nσSn−nμE[Zn]=0,var(Zn)=1
Central Limit Theorem:n→∞limP(Zn≤z)=Φ(z)=2π1∫−∞ze−t2/2dt,forallz
Proof using convergence of MGF
Let X1,…,Xn be a sequence of i.i.d. r.v.'s with mean 0, variance σ2 and associated transform (MGF) MX(s). Assume that MX(s) is finite when −d<s<d, where d is some positive number. Let Zn=σnX1+⋯+Xn
(a) The transform associated with Zn satisfies MZn(s)=(MX(σns))n Derivation: MZn(s)=E[esZn]=E[exp{σnSi=1∑nXi}]=i=1∏n[exp{σnSXi}]=(MX(σns))n
(b) Suppose that the transform MX(s) has a second order Tyler series expansion around s=0, of the form MX(s)=a+bs+cs2+o(s2) where o(s2) is a function that satisfies s→0limo(s2)/s2=0. Find a,b,c.
Derivation: a=MX(0)=1b=dsdMX(s)∣∣∣∣∣s=0=E[X]=0c=2!1ds2d2MX(s)∣∣∣∣∣s=0=21E[X2]=2σ2
(c) Combine the results of (a) and (b) to show that the transform MZn(s) converges to the transform associated with a standard normal random variable, that is, n→∞limMZn(s)=es2/2,foralls Derivation: MZn(s)=(MX(σns))n=(a+σnbs+σ2ncs2+o(σ2ns2))n=(1+2ns2+o(σ2ns2))nn→∞limMZn(s)=n→∞lim(1+n2s2)n=es2/2
5.4.1 Approximations Based on the CLT
Let Sn=X1+⋯+Xn, where the Xi are i.i.d. r.v.'s with mean μ and variance σ2. If n is large, the probability P(Sn≤c) can be approximated by treating Sn as if it were normal, according to the following procedure.
Calculate the mean nμ and the variance nσ2 of Sn
Calculate the normalized value z=nσc−nμ
Use the approximation P(Sn≤c)≈Φ(z) where Φ(z) is available from standard normal CDF tables
The normal approximation is increasingly accurate as n→∞
How large n should be depends on whether the distribution of Xi is close to normal and whether it is symmetric
The normal approximation to P(Sn≤c) tends to be more faithful when c is in the vicinity of the mean of Sn
5.4.2 De Moivre-Laplace Approximation to the Binomial
If Sn is a binomial r.v. with parameters n and p, n is large, and k,l are nonnegative integers, then P(k≤Sn≤l)≈Φ(np(1−p)l+21−np)−Φ(np(1−p)k−21−np)
5.5 The Strong Law of Large Numbers
Let X1,X2,… be a sequence of i.i.d. r.v.'s with mean μ. Then, the sequence of sample means Mn=(X1+⋯+Xn)/n converges to μ, with probability 1 P(n→∞limnX1+⋯+Xn=μ)=1
Convergence with Probability 1:
Let Y1,Y2,… be a sequence of r.v.'s (not necessarily independent). Let c be a real number. Yn converges to c with probability 1 (or almost surely) if P(n→∞limYn=c)=1
Convergence with probability 1 implies convergence in probability, but the converse is not necessarily true
Chapter 6 The Bernoulli and Poisson Processes
6.1 The Bernoulli Process
A sequene of independent Bernoulli trials Xi, at each trial i: P(Xi=1)=P(successattheithtrial)=p P(Xi=0)=P(failureattheithtrial)=1−p
Stochastic processes:
First view: infinite sequence of r.v.'s X1,X2,…
Interested in: E[Xi]=p,var(Xi)=p(1−p),pXi(x)={p1−p,x=1,x=0 pX1,…,Xn(x1,…,xn)=indep.pX1(x1)⋯pXn(xn),foralln
Second view: sample space Ω = set of infinite sequences of 0's and 1's (a phenomenon that evolves over time)
Number of successes/arrivals S in n time slots:
S=X1+⋯+Xn
pS(k)=(kn)pk(1−p)n−k,k=0,…,n
E[S]=np
var(S)=np(1−p)
Time until the first success/arrival:
T=min{i:Xi=1}
pT(k)=(1−p)k−1p
E[T]=p1
var(T)=p21−p
6.1.1 Independence, Memorylessness and Fresh-Start Properties
For any given time n, the sequence of r.v.'s Xn+1,Xn+2,… (the future of the process) is also a Bernoulli process, and is independent from X1,…,Xn (the past of the process)
Let n be a given time and let Tˉ be the time of the first success after time n. Then, Tˉ−n has a geometric distribution with parameter p and is independent of the r.v.'s X1,…,XnP(Tˉ−n=t∣Tˉ>n)=(1−p)t−1p=P(Tˉ=t)
6.1.2 Interarrival Times
Yk: the time of the kth success/arrival Tk: the kth interarrival time (Ti∼i.i.d.Geo(p)) T1=Y1Tk=Yk−Yk−1fork=2,3,…
6.1.3 Time of kth Success/Arrival
Yk=T1+⋯+Tk
E[Yk]=E[T1]+⋯+E[Tk]=pk
var(Yk)=var(T1)+⋯+var(Tk)=p2k(1−p)
pYk(t)=(k−1t−1)pk(1−p)t−k,t=k,k+1,…
6.1.4 Splitting and Merging of Bernoulli Processes
Definition: P(k,τ)=P(thereareexactlykarrivalsduringanintervaloflengthτ) An arrival process is called a Poisson process with arrival rate/intensity λ (probability per unit time) if it has the following properties:
Time-homogeneity: the probability P(k,τ) of k arrivals is the same for all intervals of the same length τ
Independence: the number of arrivals during a particular interval is independent of the history of arrivals outside this interval
Small interval probabilities: for every small interval δ, the probabilities P(k,δ) satisfy P(k,δ)=⎩⎪⎪⎨⎪⎪⎧1−λδ+O(δ2)λδ+O(δ2)0+O(δ2),k=0,k=1,k=2,3,… Here, δ→0limδO(δ2)=0
6.2.1 Number of Arrivals in an Interval
Numbers of arrivals in an interval: Nτ P(k,τ) is approximately the same as the binomial probability of k successes in n=τ/δ independent Bernoulii trials with success probability p=λδ at each interval. While keeping the length τ of the interval fixed, let the period length δ→0. Then the number of periods n→∞ and success probability p→0, while np=λτ reamins constant. pNτ(k)=P(k,τ)=e−λτk!(λτ)k,k=0,1,…E[Nτ]=λτ,var(Nτ)=λτ
Time until the first arrival: T
Assuming that the process starts at time 0. We have T>t if and only if there are no arrivals during the interval [0,t]FT(t)=P(T≤t)=1−P(T>t)=1−P(0,t)=1−e−λtfT(t)=dtdFT(t)=λe−λtE[T]=λ1,var(T)=λ21
The sum of independent Poisson random variables is Poisson: the probability of k arrivals during a set of times of total length τ is always given by P(k,τ), even if that set is not an interval
6.2.2 Independence and Memorylessness
For any given time t>0, the process after time t is also a Poisson process, and is independent from history of the process until time t
Let t be a given time and let Tˉ be the time of the first arrival after time t. Then, Tˉ−t has an exponential distribution with parameter λ, and is independent of the history of the process until time tP(Tˉ−t>s)=P(0arrivalduring[t,t+s])=P(0,s)=e−λs
6.2.3 Interarrival Times
Yk: the time of the kth success/arrival Tk: the kth interarrival time (Ti∼i.i.d.Exp(λ)) T1=Y1Tk=Yk−Yk−1fork=2,3,…
Merging: mergedprocess∼Pois(λ1+λ2)
Any particular arrival of the merged process has probability λ1/(λ1+λ2) of originating from the first process and probability λ2/(λ1+λ2) of originating from the second process, independent of all other arrivals and their origins
Competing Exponentials: three light bulbs have independent and exponentially distributed lifetimes Ta, Tb and Tc, with parameters λa, λb and λc.
(a) Find the distribution of the time until fisrt burn out, Z=min{Ta,Tb,Tc} FZ(z)=P(min{Ta,Tb,Tc}≤z)=1−P(min{Ta,Tb,Tc}>z)=1−P(Ta>z,Tb>z,Tc>z)=1−P(Ta>z)P(Tb>z)P(Tc>z)=1−e−λaze−λbze−λcz=1−e−(λa+λb+λc)z∴Z∼Exp(λa+λb+λc),E[Z]=λa+λb+λc1
(b) If λa=λb=λc=λ, find the expected time until all burn out T1=timeuntil1stburnoutT2=timeafter1stburnoutuntil2ndT3=timeafter2ndburnoutuntil3rdT1∼Exp(3λ),T2∼Exp(2λ),T3∼Exp(λ)E[T1+T2+T3]=3λ1+2λ1+λ1var(T1+T2+T3)=9λ21+4λ21+λ21
6.2.6 Bernoulli and Poisson Processes, and Sums of Random Variables
Let N,X1,X2,… be independent r.v.'s, where N takes nonnegtive integer values. Let Y=X1+⋯+XN for positive values of N, and let Y=0 when N=0
If Xi∼Bern(p) and N∼Bin(m,q), then Y∼Bin(m,pq)
If Xi∼Bern(p) and N∼Pois(λ), then Y∼Pois(λp)
If Xi∼Geo(p) and N∼Geo(q), then Y∼Geo(pq)
If Xi∼Exp(λ) and N∼Geo(q), then Y∼Exp(λq)
Consider a Poisson process with paramater λ, and an independent r.v. T, which is exponential with parameter ν. The number of Poisson arrivals NT during the time interval [0,T] is geometric shifted by 1 to the left
L=(t∗−U)+(V−t∗)(t∗−U)∼Exp(λ),(V−t∗)∼Exp(λ)L∼Erlang(2,λ),E[L]=2/λ
An observer who arrive at an arbitrary time is more likely to fall in a large rather than a small interarrival interval. As a consequence the expected length seen by the observer is higher: 2/λ compared with the 1/λ mean of the exponential PDF
6.2.8 Poisson vs. Normal Approximations of the Binomial
Binomial(n,p)
p fixed, n→∞: Normal
np fixed, n→∞, p→0: Poisson
Pois(n)
sum of i.i.d. Pois(1), n→∞: Normal
In practice (rule of thumb):
p=1/100,n=100: Poisson
p=1/3,n=100: Normal
p=1/100,n=10000: Both
Chapter 7 Markov Chains
7.1 Discrete-Time Markov Chains
Checkout counter example:
discrete time n=0,1,…
customer arrivals: Bern(p)
customer service times: Geo(q)
state Xn: number of customers in the queue at time n LetX0=2,thenX1=2,X2=2+1=3,X3=3,X4=3−1=2,X5=2,X6=2+1−1=2,…
Assume no more than 10 customers can be in the queue, the transition probability graph is as follows:
Discrete-time Markov chains:
at each time step n, the state of the chain: Xn
state place (a finite set of possible states): S={1,…,m}
transition probability matrix: ⎣⎢⎢⎢⎢⎡p11p21⋮pm1p12p22⋮pm2⋯⋯⋮⋯p1mp2m⋮pmm⎦⎥⎥⎥⎥⎤
Markov Property: the Markov chain is a sequence of r.v.'s X0,X1,X2,… that take values in S and which satisfy P(Xn+1=j∣Xn=i,Xn−1=in−1,…,X0=i0)=P(Xn+1=j∣Xn=i)=pij for all times n, all states i,j∈S, and all possible sequences i0,…,in−1 of earlier states
n-step transition probability: the probability that the state after n time periods will be j, given that the current state is irij(n)=P(Xn=j∣X0=i)=P(Xn+s=j∣Xs=i)
Chapman-Kolmogorov Equation: the n-step transition probabilities can be generated by the recursive formula rij(n)=k=1∑mrik(n−1)pkj,forn>1andalli,j starting with rij(1)=pij
7.2 Classification of States
Accessible: a state j is accessible from a state i if for some n, rij(n) is positive
Let A(i) be the set of states that are accessible from i
state i is recurrent if for every j that is accessible from i, i is also accessible from j; that is, for all j that belong to A(i) we have that i belongs to A(j)
state i is transient if there is a state j∈A(i) such that i is not accessible from j
if i is a recurrent state, the set of states A(i) form a recurrent class, meaning that states in A(i) are all accessible from each other and no states outside A(i) is accessible from them
Markov Chain Decomposition:
A Markov chain can be decomposed into one or more recurrent classes, plus possibly some transient states
A recurrent state is accessible from all states in its class, but is not accessible from recurrent states in other classes
A transient state is not accessible from any recurrent state
At least one, possibly more, recurrent states are accessible from a given transient state
(a) once the state enters (or starts in) a recurrent class, it stays within that class; since all states in the class are accessible from each other, all states in the class will be visited an infinite number of times
(b) if the initial state is transient, then the state trajectory contains an initial portion consisting of transient states and a final portion consisting of recurrent states from the same class
Periodicity: consider a recurrent class R
The class is periodic if its states can be grouped in d>1 disjoint subsets S1,…,Sd, so that all transitions from Sk lead to Sk+1 (or to S1 if k=d)
The class is aperiodic if and only if there exists a time n such that rij(n)>0,foralli,j∈R
7.3 Steady-State Behavior
Situations that rij(n) will not converge to steady-state values that are independent of the initial state:
Multiple recurrent classes: the limit of rij(n) must depend on the initial state (the posibility of visitng j in the future depends on whether j is in the same class as the initial state i)
Simple periodic recurrent class: consider a recurrent class with two states, 1 and 2, such that from 1 we can only go to 2, and from 2 we can only go to 1 (p12=p21=1) It can be seen that rij(n) generically oscillate rii(n)={10,neven,nodd
Steady-state probability: approximately independent of the original state X0=iπj≈P(Xn=j),whennislarge
Steady-State Convergence Theorem: consider a Markov chain with a single recurrent class which is aperiodic (and some transient states). Then, the states j associated with πj have the following properties
(a) For each j, we have n→∞limrij(n)=πj,foralli
(b) πj are the unique solution to the system of equations: πj=k=1∑mπkpkj,j=1,…,m(balanceequations)1=k=1∑mπk(normalizationequation)
Derivation of balance equations: rij(n)=k=1∑mrik(n−1)pkjn→∞limrij(n)=n→∞limk=1∑mrik(n−1)pkjπj=k=1∑mπkpkj
(c) We have πj=0,foralltransientstatesjπj>0,forallrecurrentstatesj
πj form a stationary distribution of the chain: if the initial state is chosen according to this distribution P(X0=j)=πj,j=1,…,mthen,P(X1=j)=k=1∑mP(X0=k)pkj=k=1∑mπkpkj=πjsimilarly,P(Xn=j)=πj,forallnandj
7.3.1 Long-Term Frequency Interpretations
Balance equations interpreted by frequency: πj=k=1∑mπkpkj
(Long-run) frequency of being in j: πj=n→∞limnνij(n) where νij(n) is the expected value of the number of visits to state j within the first n transitions, starting from state i
frequency of transitions k→j: πkpkj=n→∞limnqkj(n) where qkj(n) is the expected number of such transitions that takes the state from k to j
frequency of transitions into j: k∑πkpkj
7.3.2 Birth-Death Processes
Local balance equations:the expected frequency of transitions from i to i+1 must be equal to the expected frequency of transitions from i+1 to iπibi=πi+1di+1⎩⎪⎪⎨⎪⎪⎧πi=π0d1d2⋯dib0b1⋯bi−1i∑πi=1
Special case: bi=p and di=q for all i Letρ=p/q,πi+1=πiρ πi=π0ρi,i=0,1,…,m i∑πi=π0(1+ρ+ρ2+⋯+ρm)=1
(1) assume p=q: symmetric random walk πi=π0,i=0,…,m πi=π0=m+11 equally likely to be at any states
(2) assume p<q and m→∞: π0=i=0∑∞ρi1=1−ρ11=1−ρ πi=(1−ρ)ρi shifted geometric distribution, E[Xn]=1−ρρ
7.4 Absorption Probabilities and Expected Time to Absorption
Absorption probability equations: ⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧as=1ai=0ai=j=1∑mpijaj,forallabsorbingi=s,foralltransienti
7.4.1 Expected Time to Absorption
Expected time to absorption: μi=E[numberoftransitionsuntilabsorption,startingfromi]=E[min{n≥0suchthatXnisrecurrent}∣X0=i]
Equations for the expected time to absorption: ⎩⎪⎨⎪⎧μi=0μi=1+j=1∑mpijμj,forallrecurrentstatesi,foralltransientstatesi
7.4.2 Mean First Passage and Recurrence Times
Mean first passage time from i to s: ti=E[numberoftransitionstoreachsforthefirsttime,startingfromi]=E[min{n≥0suchthatXn=s}∣X0=i]
Equations for the mean first passage time: ⎩⎪⎨⎪⎧ts=0ti=1+j∑pijtj,foralli=s
Example: what happens after reaching to s is not in our consideration; think of s as an absorbing state
Mean recurrence time of s: ts∗=E[numberoftransitionsuptothefirstreturntos,startingfroms]=E[min{n≥1suchthatXn=s}∣X0=s]
Solution to the mean recurrence time: ts∗=1+j∑psjtj
7.5 Continuous-Time Markov Chains
Assumptions:
If the current state is i, the time until the next transition is exponentially distributed with a given parameter νi, independent of the past history of the process and of the next state
If the current state is i, the next state will be j with a given probability pij, independent of the past history of the process and of the time until the next transition
Transition rate out of state i: νi (the average number of transitions out of state i, per unit time spent at state i)
Transition rate from i to j: qij (the average number of transitions from i to j, per unit time spent at state i, since only a fraction pij of the transitions out of state i will lead to state j) qij=νipij
Obtain the transition rates and probabilities: νi=j=1∑mqijpij=νiqij
Ignore self-transitions because of the memorylessness of exponential distribution (the remaining time until next transition is the same irrespective of whether a self-transition just occurred or not) pii=qii=0,foralli
7.5.1 Approximation by a Discrete-Time Markov Chain
Let δ be a small positive time interval and consider the discrete-time Markov chain Zn that is obtained by observing X(t) every δ time units: Zn=X(nδ),n=0,1,… Given that Zn=i, there is a probability approximately equal to νiδ that there is a transition between times nδ and (n+1)δ, and in that case there is a further probability pij that the next state is j: pˉij=P(Zn+1=j∣Zn=i)=νipijδ+o(δ)=qijδ+o(δ),ifj=ipˉii=P(Zn+1=i∣Zn=i)=1−j=i∑pˉij+o(δ)
Given the current state i of a continuous-time Markov Chain, and for any j=i, the state δ time units later is equal to j with probability qijδ+o(δ) independent of the past history of the process
7.5.2 Steady-State Behavior
Consider a continuous-time Markov chain with a single recurrent class. Then, the state j are associated with steady-state probabilities πj that have the following properties:
(a) For each j, we have t→∞limP(X(t)=j∣X(0)=i)=πj,foralli
(b) πj are the unique solution to the system of equations: πjk=j∑qjk=πk∑qkj(balanceequations)k=1∑mπk=1(normalizationequation)
Derivation of the balance equations: πj∴πjk=j∑qjk=j=1∑mπkpˉkj=πjpˉjj+k=j∑πkpˉkj=πj(1−δk=j∑qjk+o(δ))+k=j∑πk(qkjδ+o(δ))=πk∑qkj
7.5.3 Birth-Death Processes
Local balance euqations: πjqji=πiqij,foralli,j
Chapter 8 Bayesian Statistical Inference
Unknown parameters are viewed as random variables with a single, fully-specified probabilistic model
8.1 Bayesian Inference and the Posterior Distribution
Bayesian Inference:
Start with a prior distribution pΘ or fΘ for the unknown r.v. Θ
Have a model pX∣Θ or fX∣Θ of the observation vector X
After observing the value x of X, we form the posterior distribution of Θ, using the appropriate version of Bayes' rule
The four versions of Bayes' rule:
Θ discrete, X discrete: pΘ∣X(θ∣x)=θ′∑pΘ(θ′)pX∣Θ(x∣θ′)pΘ(θ)pX∣Θ(x∣θ)
Θ discrete, X continuous: pΘ∣X(θ∣x)=θ′∑pΘ(θ′)fX∣Θ(x∣θ′)pΘ(θ)fX∣Θ(x∣θ)
Θ continuous, X discrete: fΘ∣X(θ∣x)=∫fΘ(θ′)pX∣Θ(x∣θ′)dθ′fΘ(θ)pX∣Θ(x∣θ)
Θ continuous, X continuous: fΘ∣X(θ∣x)=∫fΘ(θ′)fX∣Θ(x∣θ′)dθ′fΘ(θ)fX∣Θ(x∣θ)
8.2 Point Estimation, Hypothesis Testing, and the MAP Rule
Maximum a Posteriori Probability (MAP) rule: Given the value x of the observation, we select a value of θ, denoted θ^, that maximizes the posterior distribution pΘ∣X(θ∣x) or fΘ∣X(θ∣x)θ^θ^=argθmaxpΘ∣X(θ∣x),Θdiscrete=argθmaxfΘ∣X(θ∣x),Θcontinuous
Since the denominator is the same for all θ and depends only on the value of x of the observation, to maximize the posterior, we only need to choose a value of θ that maximizes the numerator: pΘ(θ)pX∣Θ(x∣θ)pΘ(θ)fX∣Θ(x∣θ)fΘ(θ)pX∣Θ(x∣θ)fΘ(θ)fX∣Θ(x∣θ)
If Θ takes only a finite number of values, the MAP rule minimizes the probability of selecting an incorrect hypothesis. This is true for both the unconditional probability of error and the conditional one, given any observation value x
8.2.1 Point Estimation
An estimator is a r.v. of the form Θ^=g(X), for some function g; different choices of g correspond to different estimators
An estimate is the value θ^ of an estimator, as determined by the realized value x of the observation X
Once the value x of X is observed, the MAP estimator, sets the estimate θ^ to a value that maximizes the posterior distribution over all possible value of θ
If the posterior distribution of Θ is symmetric around its (conditional) mean and unimodal, the maximum occurs at the mean. Then, the MAP estimator coincides with the LMS estimator (i.e., normal)
If Θ is continuous, the actual evaluation of the MAP estimate θ^ can sometimes be carried out analytically; for example. if there are no constraints on θ, by setting zero the derivative of fΘ∣X(θ∣x), or of logfΘ∣X(θ∣x), and solving for θ
Once the value x of X is observed, the Conditional Expectation (LMS) estimator sets the estimate θ^ to E[Θ∣X=x]
8.2.2 Hypothesis Testing
The MAP rule for hypothesis testing:
Given the observation value x, the MAP rule selects a hypothesis Hi for which the value of the posterior probability P(Θ=θi∣X=x) is largest
Equivalently, it selects a hypothesis Hi for which pΘ(θi)pX∣Θ(x∣θi) or pΘ(θi)fX∣Θ(x∣θi) is largest
The MAP rule minimizes the probability of selecting an incorrect hypothesis for any observation value x, as well as the probability of error over all decision rules
If gMAP(x) is the hypothesis selected by the MAP rule when X=x, the probability of correct decision is P(Θ=gMAP(x)∣X=x)
If Si is the set of all x such that the MAP rule selects hypothesis Hi, the overall probability of correct decision is P(Θ=gMAP(X))=i∑P(Θ=θi,X∈Si) and the corresponding probability of error is i∑P(Θ=θi,X∈Si)
8.3 Bayesian Least Mean Squares (LMS) Estimation
LMS: minimize Mean Squared Error (MSE)
In the absence of any observations, E[(Θ−θ^)2] is minimized when θ^=E[Θ]E[(Θ−E[Θ])2]≤E[(Θ−θ^)2],forallθ^
For any given value x of X, E[(Θ−θ^)2∣X=x] is minimized when θ^=E[Θ∣X=x]E[(Θ−E[Θ∣X=x])2∣∣∣∣X=x]≤E[(Θ−θ^)2∣X=x],forallθ^
Out of all estimators g(X) of Θ based on X, the mean squared estimation error E[(Θ−g(X))2] is minimized when g(X)=E[Θ∣X]E[(Θ−E[Θ∣X])2]≤E[(Θ−g(X))2],forallestimatorsg(X)
LMS Performance Evaluation:
Expected performance once we have a measurement: MSE=E[(Θ−E[Θ∣X=x])2∣∣∣∣X=x]=var(Θ∣X=x)
Expected performance of the estimator: MSE=E[(Θ−E[Θ∣X])2]=E[var(Θ∣X)]
8.3.1 Some Properties of the Estimation Error
Θ^=E[Θ∣X],Θ~=Θ^−Θ
Properties of estimation error:
The estimation error Θ~ is unbiased, it has zero unconditional and conditional mean E[Θ~]=0,E[Θ~∣X=x]=0,forallx
The estimation error Θ~ is uncorrelated with the estimate Θ^cov(Θ~,Θ^)=0
The variance of Θ can be decomposed as var(Θ)=var(Θ^)+var(Θ~)
8.3.2 The Case of Multiple Observations and Multiple Parameters
Multiple parameters: find Θ^1,…,Θ^m to minimize E[(Θ1−Θ^1)2]+⋯+E[(Θm−Θ^m)2] essentially dealing with m decoupled estimation problems, one for each unknown parameter Θi, yielding Θ^i=E[Θi∣X1,…,Xn],foralli
8.4 Bayesian Linear Least Mean Squares (LLMS) Estimation
8.4.1 LLMS Estimation Based on a Single Observation
The LLMS estimator Θ^ of Θ based on X is Θ^=E[Θ]+var(X)cov(Θ,X)(X−E[X])=E[Θ]+ρσXσΘ(X−E[X])
Derivation: let Θ^=aX+b that minimizes the MSE E[(Θ−aX−b)2]
First, choosing a constant b to estimate the r.v. Θ−aX. The best choice is b=E[Θ−aX]=E[Θ]−aE[X] with this choice of b, it remains to minimize, with respect to a that E[(Θ−aX−E[Θ−aX])2]=var(Θ−aX)=var(Θ)+a2var(X)−2cov(Θ,aX)=σΘ2+a2σX2−2a⋅cov(Θ,X) to minimize var(Θ−aX), we set its derivative to zero and solve for adadvar(Θ−aX)2σX2a−2cov(Θ,X)a=σX2cov(Θ,X)=0=0=ρσXσΘ
The resulting MSE is equal to (1−ρ2)σΘ2
Derivation: E[(Θ−Θ^)2]=var(Θ−aX)=σΘ2+a2σX2−2a⋅cov(Θ,X)=σΘ2+ρ2σX2σΘ2σX2−2ρσXσΘρσΘσX=(1−ρ2)σΘ2
The estimator starts with the baseline estimate E[Θ] for Θ, which is then adjusted by taking into account the value of X−E[X]. For example, when X is larger than its mean, the positive correlation between X and Θ suggests that Θ is expected to be larger than its mean. Accordingly, teh resulting estimate is set to a value larger than E[Θ].
When ∣ρ∣ is close to 1, the two r.v.'s are highly correlated, and knowing X allows us to accurately estimate Θ, resulting in a small MSE
The properties of the estimation error presented in LMS hold when Θ^ is LLMS estimator
8.4.2 The Case of Multiple Observations and Multiple Parameters
Multiple parameters: E[(Θ1−Θ^1)2]+⋯+E[(Θm−Θ^m)2] find a linear estimator Θ^i that minimizes E[(Θi−Θ^i)2], so that we are essentially dealing with m decoupled linear estimation problems, one for each unknown parameter
In the case where there are multiple observations with a certain independence property, let Θ be a r.v. with mean μ and variance σ02, and let X1,…,Xn be observations of the form Xi=Θ+Wi where Wi are the r.v.'s with mean 0 and variance σi2, which represent observation errors. Under the assumption that Θ and W1,…,Wn are uncorrelated, the LLMS estimator of Θ, based on the observations X1,…,Xn, is Θ^=i=0∑n1/σi2μ/σ02+i=1∑nXi/σi2
8.4.3 Linear Estimation and Normal Models
The LLMS estimator is generally different from and inferior to the LMS estiamtor E[Θ∣X1,…,Xn]. However, if the LMS estimator happens to be linear in the observations X1,…,Xn, then it is also the LLMS estimator
Example: the estimation of a normal r.v. Θ on the basis of observations Xi=Θ+Wi, where Wi are independent zero mean normal noise terms, independent of Θ
More generally, if Θ,X1,…,Xn are all linear functions of a collection of independent normal random variables, then the LMS and the LLMS estimators coincide; they also coincide with the MAP estimator
8.4.4 The Choice of Variables in Linear Estimation
Estimation based on h(X) versus X:
LMS: E[Θ∣h(X)]=E[Θ∣X]
LLMS: different, estimator Θ^=ah(X)+b versus Θ^=aX+b
for example, suppose that Θ is the unknown variance of some distribution and X1,…,Xn represent independent random variables drawn from that distribution; then it would be unreasonable to expect that a good estimator of Θ can be obtained with a linear function of X1,…,Xn
simple calculation, minimize MSE based on linear function of observation, where MSE=(1−ρ2)σΘ2
Chapter 9 Classical Statistical Inference
Unknown parameters are viewed as constants to be determined. each possible value of the unknown parameter has a separated probabilistic model
9.1 Classical Parameter Estimation
9.1.1 Properties of Estimators
Let Θ^n be an estimator of an unknown parameter θ, that is, a function of n observations X1,…,Xn whose distribution depends on θ
Estimation error: Θ~n=Θ^n−θ
Bias of the estimator: bθ(Θ^n)=E[Θ~n]=Eθ[Θ^n]−θ
The expected value, variance, and bias of Θ^n depend on θ, while the estimation error depends in addition on the observations X1,…,Xn
We call Θ^n unbiased if Eθ[Θ^n]=θ, for every possible value of θ
We call Θ^n asymptotically unbiased if n→∞limEθ[Θ^n]=θ, for every possible value of θ
We call Θ^n consistent if the sequence Θ^n converges to the true value of parameter θ, in probability, for every possible value of θ (WLLN)
Mean Squared Error: E[Θ~n2]=bθ2(Θ^n)+varθ(Θ^n)
9.1.2 Maximum Likelihood Estimation (ML)
Let the vector of observations X=(X1,…,Xn) be described by a joint PMF pX(x;θ) whose form depends on an unknown (scalar or vector) parameter θ. Suppose we observe a particular value x=(x1,…,xn) of X. Then, a ML estimate is a value of the parameter that maximizes the numerical function pX(x1,…,xn;θ) over all θθ^n=argθmaxpX(x1,…,xn;θ),Xdiscreteθ^n=argθmaxfX(x1,…,xn;θ),Xcontinuous where pX(x;θ) or fX(x;θ) is called likelihood function, which means the probability that the observed value x can arise when the parameter is equal to θ
In many applications, the observations Xi are assumed to be independent, then the likelihood function is of the form: pX(x1,…,xn;θ)=i=1∏npXi(xi;θ),XdiscretefX(x1,…,xn;θ)=i=1∏nfXi(xi;θ),Xcontinuous
it is ofte analytically or computationally convenient to maximize its logarithm, called the log-likelihood function: lnpX(x1,…,xn;θ)=i=1∑nlnpXi(xi;θ),XdiscretelnfX(x1,…,xn;θ)=i=1∑nlnfXi(xi;θ),Xcontinuous
ML versus MAP: in Bayesian MAP estimation, the estimate is chosen to maximize the expression pΘ(θ)pX∣Θ(x∣θ) over all θ
in the case of discrete θ, if we view pX(x;θ) as a conditional PMF, ML estimation can be interpreted as MAP estimation with a flat prior (a prior which is the same for all θ, indicating the absence of any useful prior knowledge)
in the case of continuous θ with a bounded range of possible values, ML estimation can be interpreted as MAP estimation with a uniform prior: fΘ(θ)=c for all θ and some constant c
Maximum Likelihood Estimation:
We are given the realization x=(x1,…,xn) of a random vector X=(X1,…,Xn), distributed according to a PMF pX(x;θ) or PDF fX(x;θ)
The ML estimate is a value of θ that maximizes the likelihood function, pX(x;θ) or fX(x;θ), over all $theta$
The invariance principle: the ML estimate of a one-to-one function h(θ) of θ is h(θ^n), where h(θ^n) is the ML estimate of θ
When the r.v.'s Xi are i.i.d., and under some mild additional assumptions, each component of the ML estimator is consistent and asymptotically normal
9.1.3 Estimation of the Mean and Variance of a Random Variable
Let the observations X1,…,Xn be i.i.d., with mean θ and variance v that are unknown
the sample mean Mn=nX1+⋯+Xn is an unbiased estimator of θ, and its MSE=nv
two variance estimators are Sˉn2=n1i=1∑n(Xi−Mn)2,S^n2=n−11i=1∑n(Xi−Mn)2
the estimator Sˉn2 coincides with the ML estimator if the Xi are normal; it is biased but asymptotically unbiased; the estimator S^n2 is unbiased; for large n, the two variance estimators essentially coincide
Example:
Bayesian MAP estimator of the common mean of a normal random variable:
Suppose that the observations X1,…,Xn are normal, i.i.d., with an unknown common mean θ and known variance v. If we use a Bayesian approach and assume a normal prior distribution on the parameter θ. For the case where the prior mean of θ is zero, we arrive at the following estimator Θ^n=n+1X1+⋯+Xn this estimator is biased because bθ(Θ^n)=Eθ[Θ^n]−θ=n+1nθ−θ=−n+1θ, but is asymptotically unbiased because n→∞limbθ(Θ^n)=0
its variance is var(Θ^n)=(n+1)2nv it is slightly smaller than the variance nv of the sample mean; var(Θ^n) is independent of θ given known variance v; the MSE is equal to Eθ[Θ~n2]=bθ2(Θ^n)+var(Θ^n)=−(n+1)2θ2+(n+1)2nv
Classical ML estimator of the mean and variance of a normal random variable:
Estimate the mean μ and variance v of a normal distribution using n independent observations X1,…,Xn. The parameter vector is θ=(μ,v). The corresponding likelihood function is fX(x;μ,v)=i=1∏nfXi(xi;μ,v)=i=1∏n2πv1e−(xi−μ)2/2v=(2πv)n/21⋅exp{−2vnsn2}⋅exp{−2vn(mn−μ)2} where mn,sn2 are the realized value of the random variable Mn=n1i=1∑nXi,Sˉn2=n1i=1∑n(Xi−Mn)2 The log-likelihood function is lnfX(x;μ,v)=−2nln2π−2nlnv−2vnsn2−2vn(mn−μ)2⎩⎪⎪⎨⎪⎪⎧∂μ∂lnfX(x;μ,v)=−2vn(2μ−2mn)=0∂v∂lnfX(x;μ,v)=−2v2n(v−sn2−(mn−μ)2)=0⟹{μ=mnv=sn2∴θ^=(mn,sn2),Θ^=(Mn,Sˉn2) E(μ,v)[Mn]=μ,E(μ,v)[Xi2]=v+μ2,E(μ,v)[Mn2]=nv+μ2E(μ,v)[Sˉn2]=n1E(μ,v)[i=1∑nXi2−2Mni=1∑nXi+nMn2]=E(μ,v)[n1i=1∑nXi2−2Mn2+Mn2]=v+μ2−nv−μ2=nn−1v thus, Sˉn2 is not an unbiased estimator of v, although it is asymptotically unbiased; we can obtain an unbiased variance estimator after some suitable scaling S^n2=n−11i=1∑n(Xi−Mn)2=n−1nSˉn2
9.1.4 Confidence Intervals
A confidence interval (CI) for a scalar unknown parameter θ is an interval whose endpoints Θ^n− and Θ^n+ bracket θ with a given high probability
Θ^n− and Θ^n+ are random variables that depend on the observations X1,…,Xn
A 1−α CI is one that satisfies Pθ(Θ^n−≤θ≤Θ^n+)≥1−α for all possible values of θ
CI for the estimation of the mean:
Suppose that the observations Xi are i.i.d. normal, with unknown mean θ and known variance v. Then, the sample mean estimator Θ^n=nX1+⋯+Xn is normal (the sum of independent normal r.v.'s is normal), with mean θ and variance v/n. Let α=0.05, we have Φ(1.96)=0.975=1−α/2Pθ(v/n∣Θ^n−θ∣≤1.96)=0.95Pθ(Θ^n−1.96nv≤θ≤Θ^n+1.96nv)=0.95
9.1.5 Confidence Intervals Based on Estimator Variance Approximations
Suppose that the observations Xi are i.i.d. with mean θ and variance v that are unknown:
estimate θ with the sample mean Θ^n=nX1+⋯+Xn estimate v with the unbiased estimator S^n2=n−11i=1∑n(Xi−Θ^n)2 estimate the variance v/n of the sample mean by S^n2/n then for a given α, we may use these estimates and the CLT to construct an approximate 1−α confidence interval [Θ^n−znS^n,Θ^n+znS^n] where z is obtained from the relation Φ(z)=1−2α Note that in this approach, there are two different approximations in effect. First, Θ^n is treated as a normal random variable; second, the true variance v/n of Θ^n is replaced by its estimate S^n2/n
Even in the special case where the Xi are normal random variables, the CI produced by the preceding procedure is still approximate. The reason is that S^n2 is only an approximation to the true variance v and the random variable Tn=S^n/nΘ^n−θ is not normal. However, for normal Xi it can be shown that the PDF of Tn does not depend on θ and v. It is called the t-distribution with n−1 degrees of freedom. Thus, when the Xi are normal (or nearly normal) and n is relatively small, a more accurate CI is [Θ^n−znS^n,Θ^n+znS^n] where z is obtained from the relation Ψn−1(z)=1−2α
9.2 Linear Regression
First consider the case of only two variables. To model the relation between two variables, x and y, based on a collection of data pairs (xi,yi),i=1,…,n, build a linear model of the form y≈θ0+θ1x where θ0 and θ1 are unknown parameters to be estimated. Given some estimates θ^0 and θ^1 of the resulting parameters, the value yi corresponding to xi as predicted by the model is yi^=θ^0+θ^1xi Generally, yi^ will be different from the given value yi, and the corresponding difference is called the ith residual yi~=yi−yi^ The linear regression approach chooses the parameter estimates θ^0 and θ^1 that minimizes the sum of the squared residuals over all θ0 and θ1i=1∑n(yi−y^i)2=i=1∑n(yi−θ0−θ1xi)2θ^1=i=1∑n(xi−xˉ)2i=1∑n(xi−xˉ)(yi−yˉ),θ^0=yˉ−θ^1xˉwherexˉ=n1i=1∑nxi,yˉ=n1i=1∑nyi
9.2.1 Justification of the Least Squares Formulation
(a) Maximum likelihood (linear model, normal noise):
Assume that the xi are given numbers (not random variables), and yi is the realization of a random variable Yi, generated according to a model of the form Yi=θ0+θ1xi+Wi,i=1,…,n where the Wi are i.i.d. normal random variables with mean zero and variance σ2. It follows that the Yi are independent normal variables with mean θ0+θ1xi and variance σ2. The likelihood function takes the form fY(y;θ)=i=1∏nσ2π1exp{−2σ2(yi−θ0−θ1xi)2} Maximizing the likelihood function is the same as minimizing the sum of the squard residuals. Thus, the linear regression estimates can be viewed as ML estimates within a suitable linear/normal context. In fact, they can be shown to be unbiased estimates in this context.
(b) Approximate Bayesian LLMS estimation (under a possibly nonlinear model):
Suppose that both xi and yi are realizations of random variables Xi and Yi. The different pairs (Xi,Yi) are i.i.d., but with unknown joint distribution. Consider an additional independent pair (X0,Y0), with the same joint distribution. Suppose we observe X0 and wish to estimate Y0 using a linear estimator of the form Y^0=θ0+θ1X0. The LLMS estimator of Y0 given X0 is Y^0=E[Y0]+var(X0)cov(X0,Y0)(X0−E[X0]) yielding θ1=var(X0)cov(X0,Y0),θ0=E[Y0]−θ1E[X0] Since the distribution of (X0,Y0) is unknown, we use xˉ as an estimator of E[X0]; yˉ as an estimator of E[Y0]; n1i=1∑n(xi−xˉ)(yi−yˉ) as an estimator of cov(X0,Y0); n1i=1∑n(xi−xˉ)2 as an estimator of var(X0);
by substituting these estimates into the above formulas for θ0 and θ1, we recover the expressions for the linear regression parameter estimates.
(c) Approximate Bayesian LMS estimation (linear model):
Let the pairs (Xi,Yi) be random and i.i.d. as in (b) above. Assume that the pairs satisfy a linear model of the form Yi=θ0+θ1Xi+Wi where Wi are i.i.d. zero mean noise terms, independent of Xi. From LMS property of conditional expectations, E[Y0∣X0] minimizes the MSE E[(Y0−g(X0))2], over all functions g. Under given assumptions, E[Y0∣X0]=θ0+θ1X0. Thus, the true parameters θ0 and θ1 minimize E[(Y0−θ0′−θ1′X0)2] over all θ0′ and θ1′. By WLLN, this expression is the limit as n→∞ of n1i=1∑n(Yi−θ0′−θ1′Xi)2 This indicates that we will obtain a good approximation of the minimizers of E[(Y0−θ0′−θ1′X0)2] (the true parameters), by minimizing the above expression (with Xi and Yi replaced by their observed values xi and yi respectively), which is the same as minimizing the sum of the squared residuals.
9.2.2 Bayesian Linear Regression
Model:
(a) Assume a linear relation Yi=Θ0+Θ1xi+Wi
(b) The xi are modeled as known constants
(c) The random variables Θ0,Θ1,W1,…,Wn are normal and independent
(d) The random variables Θ0 and Θ1 have mean zero and variances σ02,sigma12 respectively
(e) The random variables Wi have mean zero and variance σ2
Estimation Formulas:
Given the data pairs (xi,yi), the MAP estimates of Θ0 and Θ1 are θ^1=σ2+σ12i=1∑n(xi−xˉ)2σ12i=1∑n(xi−xˉ)(yi−yˉ)θ^0=σ2+nσ02nσ02(yˉ−θ^1xˉ)wherexˉ=n1i=1∑nxi,yˉ=n1i=1∑nyi
If σ2 is very large compared to σ02 and σ12, we obtain θ0≈0 and θ1≈0. What is happening here is that the observations are too noisy and are essentially ignored, so that the estimates become the same as the prior means, which we assumed to be zero.
If we let the prior variances σ02 and σ12 increase to infinity, we are indicating the absence of any useful prior information on Θ0 and Θ1. In this case, the MAP estimates become independent of σ2, and they agree with the classical linear regression formulas.
Suppose, for simplicity, that xˉ=0. When estimating Θ1, the values of yi of the observations Yi are weighted in proportion to the associated values xi. When xi is large, the contribution of Θ1xi to Yi is relatively large, and therefore Yi contains useful information on Θ1. Conversely, if xi is zero, teh observation Yi is independent of Θ1 and can be ignored.
The estimates θ^0 and θ^1 are linear functions of the yi, but not of the xi. Recall that the xi are treated as exogenous, non-random quantities, whereas the yi are observed values of the random variables Yi. Thus the MAP estimators Θ^0 and Θ^1 are linear estimators. In view of normality assumptions, the estimators are also Bayesian LLMS estimators as well as LMS estimators.
9.2.3 Multiple Linear Regression
y≈θ0+j=1∑mθjhj(x) where the hj are functions that capture the general form of the anticipated dependence of y on x. We may then obtain parameters θ^0,θ^1,…,θ^m by minimizing over θ0,θ1,…,θm the expression i=1∑n(yi−θ0−j=1∑mθjhj(xi))2 The minimization problem is known to admit closed form as well as efficient numerical solutions.
9.2.4 Nonlinear Regression
y≈h(x;θ) where h is a given function and θ is a parameter to be estimated. Given data pairs (xi,yi),i=1,…,n, we seek a value of θ that minimizes the sum of the squared residuals i=1∑n(yi−h(xi;θ))2 Unlike linear regression, this minimization problem does not admit a closed form solution. Similar to linear regression, nonlinear least squares can be motivated as ML estimation with a model of the form Yi=h(xi;θ)+Wi where the Wi are i.i.d. normal random variables with zero mean and σ2 variance. The likelihood function takes the form fY(y;θ)=i=1∏nσ2π1exp{−2σ2(yi−h(xi;θ))2} Maximizing the likelihood function is the same as minimizing the sum of the squared residuals.
9.2.5 Practical Considerations
Heteroscedasticity:
The motivation of linear regression as ML estimation in the presence of normal noise terms Wi contains the assumption that the variance of Wi is the same for all i. Quite often, however, the variance of Wi varies substantially over the data pairs. For example, the variance of Wi may be strongly affected by the value of xi. In this case, a few noise terms with large variance may end up having an undue influence on the parameter estimates. An appropriate remedy is to consider a weighted least squares criterion of the form i=1∑nαi(yi−θ0−θ1xi)2, where the weights αi are smaller for those i for which the variance of Wi is large.
Nonlinearity:
Often a variable x can explain the values of a variable y, but the effect is nonlinear. A regression model based on data pairs of the form (h(xi),yi) may be more appropriate, with a suitably chosen function h.
Multicollinearity:
Suppose that we use two explanatory variabels x and z in a model that predicts another variable y. If the two variables x and z bear a strong relation, the estimation procedure may be unable to distinguish reliably the relative effects of each explanatory variable.
Overfitting:
Multiple regression with a large number of explanatory variables and a corresponding large number of parameters to be estimated runs the danger of producing a model that fits the data well, but is otherwise useless. For example, suppose that a linear model is valid but we choose to fit 10 given data pairs with a polynomial of degree 9. The resulting polynomial model will provide a perfect fit to the data, but will nevertheless be incorrect. As a rule of thumb, there should be at least five times (preferably ten times) more data points than there are parameters to be estimated.
Causality:
The discovery of a linear relation between two variables x and y should not be mistaken for a discovery of a causal relation. A tight fit may be due to the fact that variable x has a causal effect on y, but may be equally due to a causal effect of y on x. Alternatively, there may be some external effect, described by yet another variable z, that affects both x and y in similar ways.
9.3 Binary Hypothesis Testing
Hypotheses:
H0: null hypothesis
H1: alternative hypothesis
Type of Errors:
Type I error (false rejection): α(R)=P(X∈R;H0)
Type II error (false acceptance): β(R)=P(X∈/R;H1)
Analogy with Bayesian hypothesis testing involving two hypothesis Θ=θ0 and Θ=θ1:
The overall probability of error is minimized by using the MAP rule: given the observed value x of X, declare Θ=θ1 to be true if pΘ(θ0)pX∣Θ(x∣θ0)<pΘ(θ1)pX∣Θ(x∣θ1)
This decision rule can be rewritten as follows: define the likelihood ratio L(x) by L(x)=pX∣Θ(x∣θ0)pX∣Θ(x∣θ1) and declare Θ=θ1 to be true if the realized value x of the observation vector X satisfies L(x)>ξ where the critical value ξ is ξ=pΘ(θ1)pΘ(θ0)
Motivated by the preceding form of the MAP rule, consider rejection regions of the form R={x∣L(x)>ξ} where the likelihood ratio L(x) is defined similar to the Bayesian case L(x)=pX(x;H0pX(x;H1),orL(x)=fX(x;H0fX(x;H1) The critical value ξ remains free to be chosen on the basis of other considerations. The special case where ξ=1 corresponds to the ML rule.
Choosing ξ trades off the probabilities of the two types of errors. As ξ increases, the rejection region becomes smaller. As a result, the false rejection probility α(R) decreases, while the false acceptance probability β(R) increases. Because of the trade off, there is no single best way of choosing the critical value.
The most popular approach is Likelihood Ratio Test (LRT)
Start with a target value α for the false rejection probability
Choose a value for ξ such that the false rejection probability is equal to α: P(L(X)>ξ;H0)=α
Once the value x of X is observed, reject H0 if L(x)>ξ
Typical choices for α are α=0.1,α=0.05,orα=0.01, depending on the degree of undesirability of false rejection. To be able to apply the LRT to a given problem, the following are required:
(a) L(x) for any given observation value x must be able to compute
(b) We must either have a closed form expression for the distribution of L(X) or of a related random variable such as lnL(X) or we must be able to approximate it analytically, computationally or through simulation; this is needed to determine the critical value ξ that corresponds to a given false rejection probability α
When L(X) is a continuous random variable, the probability P(L(X)>ξ;H0) moves continuously from 1 to 0 as ξ increases. Thus, we can find a value of ξ for which the requirement P(L(X)>ξ;H0)=α is satisfied. However, if L(X) is a discrete random variable, it may be impossible to satisfy the equality P(L(X)>ξ;H0)=α exactly. In such cases, there are several possibilities:
(a) Strive for approximate equality
(b) Choose the smallest value of ξ that satisfies P(L(X)>ξ;H0)≤α
(c) Use an exogenous source of randomness to choose between two alternative candidate critical values. (randomized likelihood ratio test)
For a given false rejection probability, the LRT offers the smallest possible false acceptance probability:
Neyman-Pearson Lemma
Consider a particular choice of ξ in the LRT, which results in error probabilities P(L(X)>ξ;H0)=α,P(L(X)≤ξ;H1)=β Suppose that some other test, with rejection region R, achieves a smaller or equal false rejection probability: P(X∈R;H0)≤α Then, P(X∈/R;H1)≥β with strict inequality P(X∈/R;H1)>β when P(X∈R;H0)<α
The Neyman-Pearson Lemma states that all pairs (α(ξ),β(ξ)) corresponding to LRTs lie on the efficient frontier
9.4 Significance Testing
Restrict the scope of discussion to situations with the following characteristics:
(a) Parametric models: Assume that the observations X1,…,Xn have a distribution governed by a joint PMF/PDF, which is completely determined by an unknown parameter θ (scalar or vector), belonging to a given set M of possible parameters
(b) Simple null hypothesis: The null hypothesis asserts that the true value of θ is equal to a given element θ0 of M
(c) Alternative hypothesis: The alternative hypothesis H1 is just the statement that H0 is not true, i.e., that θ=θ0
9.4.1 The General Approach
Significance Testing Methodology:
A statistical test of a hypothesis "H0:θ=θ∗" is to be performed, based on the observations X1,…,Xn
Before the data are observed:
(a) Choose a statisticS, that is, a scalar random variable that will summarize the data to be obtained. Mathematically, this involves the choice of a function h: Rn→R resulting in the statistics S=h(X1,…,Xn)
(b) Determine the shape of the rejection region by specifying the set of values of S for which H0 will be rejected as a function of a yet undetermined critical value ξ
(c) Choose the significance level, i.e., the desired probability α of a false rejection of H0
(d) Choose the critical valueξ so that the probability of false rejection is equal (or approximately equal) to α; tt this point, the rejection region is completely determined
Once the values x1,…,xn of X1,…,Xn are observed:
(i) Calculate the value s=h(x1,…,xn) of the statistics S
(ii) Reject the hypothesis H0 if s belongs to the rejection region
Quite often, statisticians skip step (c) and (d) and calculate the realized value s of S; determine and report an associated p-value p−value=min{α∣H0wouldberejectedattheαsignificancelevel}
9.4.2 Generalized Likelihood Ratio and Goodness of Fit Tests
Generalized likelihood ratio test: "is there a model compatible with H1 that provides a better explanation for the observed data than that provided by the model corresponding to H0"
(a) Estimate a model by ML, i.e., determine a parameter vector θ^=(θ^1,…,θ^m) that maximizes the likelihood function pX(x;θ) over all vectors θ
(b) Carry out a LRT that compares the likelihood pX(x;θ∗) under H0 to the likelihood pX(x;θ^) corresponding to the estimated model; form the generalized likelihood ratio pX(x;θ∗)pX(x;θ^) and if it exceeds a critical value ξ, reject H0. As in binary hypothesis testing, we choose ξ so that the probability of false rejection is (approximately) equal to a given significance level α
Goodness of fit test: test whether a given PMF conforms with observed data
Consider a random variable that takes values in the finite set {1,…,m}, and let θk be the probability of outcome k. Thus, the distribution (PMF) of this random variable is described by the vector parameter θ=(θ1,…,θm). Consider the hypotheses H0:θ=(θ1∗,…,θm∗),H1:θ=(θ1∗,…,θm∗) where the θk∗ are given nonnegative numbers that sum to 1. We draw n independent samples of the random variable and let Nk be the number of samples that result in outcome k. Thus, our observation is X=(N1,…,Nm) and we denote its realized value by X=(n1,…,nm). N1+⋯+Nm=n1+⋯+nm=1
ML estimation, a maximization over the set of probability distributions (θ1,…,θm); the PMF of the observation vector X is multinomial and the likelihood function is pX(x;θ)=c(n1,…,nmn)θ1n1⋯θmnmlnpX(x;θ)=lnc+n1lnθ1+⋯+nm−1lnθm−1+nmln(1−θ1−⋯−θm−1) Assume that the vector θ^ that maximizes the log-likelihood has positive components, it can be found by setting the derivatives with respect to θ1,…,θm−1 to zero ∂θ1∂pX(x;θ)=θ1n1+1−θ1−⋯−θm−1nm⋅(−1)=0similarly,θk^nk=1−θ^1−⋯−θ^m−1nm,k=1,…,m−1 Since the term on the right-hand side is equal to θ^mnm, all ratios θ^knk must be equal. Using the fact n1+⋯+nm=nθ^k=nnk,k=1,…,m
The generalized likelihood ratio test rejectH0ifpX(x;θ∗)pX(x;θ^)=k=1∏m(θk∗)nk(nk/n)nk>ξ By taking logarithms, the test reduces to rejectH0ifk=1∑mnklnnθk∗nk>lnξ We need to determine ξ by taking into account the required significance level, that is P(S>lnξ;H0)=α,whereS=k=1∑mNklnnθk∗Nk The distribution of S under H0 is not readily available and can only be simulated
Major simplifications are possible when n is large. In this case, the observed frequencies θ^k=nk/n will be close to θk∗ under H0, with high probability. Then, a second order Taylor series expansion shows that statistic S can be approximated well by T/2, where T=k=1∑mnθk∗(Nk−nθk∗)2 When n is large, it is known that under the hypothesis H0. The distribution of T (2S) approaches χ2 distribution with m−1 degrees of freedom. Thus, approximately correct values of P(2S>γ;H0)=α can be obtained from the χ2 tables.